Optimization using multi-threaded parallel processing framework

ABSTRACT

Systems, methods, and instrumentalities are disclosed for encoder and/or decoder optimization using a multi-threaded parallel processing framework. An encoding and/or decoding device may receive a video sequence that includes a plurality of first-temporal level pictures associated with a first temporal level and a plurality of second-temporal level pictures associated with a second temporal level. The encoding and/or decoding device may allocate a first number of parallel processing threads for encoding and/or decoding the first-temporal level pictures and a second number of parallel processing threads for encoding and/or decoding the second-temporal level pictures. The device may perform this allocation based on temporal level priority, for example. The encoding and/or decoding device may encode and/or decode the first-temporal level pictures and the second-temporal level pictures. This encoding and/or decoding may be based on the allocation of the first number of parallel processing threads and the second number of parallel processing threads.

CROSS REFERENCE

This application claims the benefit of U.S. Provisional Application No.62/061,648 filed on Oct. 8, 2014, which is incorporated herein byreference as if fully set forth.

BACKGROUND

Digital video services are rapidly expanding beyond fixed TV servicesover satellite, cable and terrestrial broadcasting to Internet-enabledmobile devices. Advances in resolution and computational capabilities ofconsumer devices (e.g., smart phones and tablets), expansion of videoapplications (e.g., video chat, mobile video recording, sharing andstreaming) and an ever-increasing number of device and video consumersand producers have led to an increase in mobile video content generationand delivery. Consequently, demand has increased for video codingsupport for high resolutions (e.g., HD, fullHD and UHD) in consumerdevices.

Video coding systems may be used to compress digital video signals toreduce storage requirements and/or transmission bandwidth. Differenttypes of video coding systems include block-based, wavelet-based,object-based and block-based hybrid video coding systems. Block-basedvideo coding systems may be based on international video codingstandards, such as MPEG1/2/4 part 2, H.264iMPEG-4 part 10 Advanced VideoCoding (MPEG-4 AVC), VC-1 and High Efficiency Video Coding (HEVC)/H.265standards. Some block-based video coding systems may have suboptimalcoding and/or suboptimal operation. Improvements in operation (e.g.,encoding and/or decoding speed) may result in suboptimal coding (e.g.,loss of compression efficiency).

SUMMARY

Systems, methods, and instrumentalities are disclosed for encoder (e.g.,HEVC encoder) and decoder (e.g., HEVC decoder) optimization using amulti-threaded parallel processing framework. Optimization may beimplemented by a multi-level, multi-threaded parallel processingframework, which may be applied at picture and/or slice levels.

A video coding device (e.g., encoding and/or decoding device) mayinclude one or more processors that may allocate parallel processingthreads amongst pictures based on their respective temporal level (TL)priority. For example, a video sequence that may include picturesassociated with different temporal levels may be received.

The video sequence may include a plurality of first-temporal levelpictures associated with a first temporal level and a plurality ofsecond-temporal level pictures associated with a second temporal level.For example, the second-temporal level pictures may be non-referencepictures, and the first-temporal level pictures may be referencepictures. The video coding device may allocate, based on temporal levelpriority, a first number of parallel processing threads for encodingand/or decoding the first-temporal level pictures and a second number ofparallel processing threads for encoding and/or decoding thesecond-temporal level pictures. For example, the first number ofparallel processing threads may be larger than the second number ofparallel processing threads, e.g., if the first temporal level isassociated with a higher priority than the second temporal level. Aportion of the first number of parallel processing threads may beallocated to a picture within the first-temporal level pictures based onthe amount of times other pictures reference the first picture.

The first-temporal level pictures and the second-temporal level picturesmay be encoded and/or decoded based on the allocation. For example, thefirst-temporal level pictures may be coded using the allocated firstnumber of the parallel processing threads, and in parallel, thesecond-temporal level pictures may be coded using the allocated secondnumber of the parallel processing threads.

The parallel processing threads may be allocated such that thelow-temporal level pictures of a group of pictures (GOP) may be coded inparallel with high-temporal level pictures of another GOP. For example,a first GOP and a second GOP may each include first-temporal levelpictures and second-temporal level pictures. When the first temporallevel pictures of the first GOP have finished being coded, the threadsallocated for coding the first-temporal level pictures may be used tocode the first-temporal level pictures of the second GOP. Thefirst-temporal level pictures of the second GOP and the second-temporallevel pictures of the first GOP may be encoded or decoded in parallelusing, e.g., their respective allocated parallel processing threads.

In an embodiment, the parallel processing threads may be allocated topictures in a GOP. The threads may be allocated to code theflint-temporal level pictures (e.g., pictures associated with higherpriority) first, before being allocated to code the second-temporallevel pictures (e.g., pictures associated with lower priority). Whencoding the second-temporal level pictures, the parallel processingthreads may be allocated among the second-temporal level pictures in around robin style. The threads may be evenly distributed among thesecond-temporal level pictures.

In an embodiment, the parallel processing threads may be allocated topictures in different GOPs based at least in part on GOP priority. Forexample, the video coding device may allocate available parallelprocessing threads to a first GOP and a second GOP based on theirrespective GOP priority.

A video coding device may include a processor configured to receive avideo sequence having multiple pictures. The device may determine thenumber of parallel processing threads to be allocated to code a picturebased on whether the picture is referenced by other pictures in thevideo sequence. The number of parallel processing threads to beallocated to a picture may be determined based on the frame type of thepicture (e.g., i-frame, b-frame, or p-frame). The number of parallelprocessing threads to be allocated to a picture may be determined basedon the temporal hierarchy, and/or coding complexity. The picture may becoded using the allocated threads.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an example of a system in which High Efficiency Video Coding(HEVC) video coding technologies may be implemented.

FIG. 2 is an example of a block-based video encoder in which HEVC videocoding technologies may be implemented.

FIG. 3 shows an example of a video playback system in which HEVC videocoding technologies may be implemented.

FIG. 4 is an example of a block-based video decoder. Video decoder maybe a single layer video decoder.

FIG. 5 shows an example of an implementation of HEVC with eightprediction unit (PU) modes for an inter-coded coding unit (CU).

FIG. 6 is an example of an HEVC implementation with prioritized,hierarchical processing of HEVC coding when a group of pictures (GOP)size is 8.

FIG. 7 shows an example architecture for a multi-level, multi-threadedparallel processing framework for HEVC encoding.

FIG. 8 shows an example of picture encoding order for a multi-threadedframework.

FIG. 9 shows an example of profiling results in terms of encoding timeas a percent of overall encoding time for HEVC reference encoderHM-12.0.

FIG. 10 shows an example of an encoding order for an HEVC encoder withrandom access main configuration.

FIG. 11 shows an example of a modified encoding order for an HEVCencoder with random access main configuration.

FIG. 12A shows an example of context-based adaptive binary arithmeticcoding (CABAC) dependency using wavefront parallel processing (WPP)methodology.

FIG. 12B shows an example of motion dependency using WPP methodology.

FIG. 13 shows an example of average encoding time per frame based on anumber of WPP threads in an example encoder.

FIG. 14 shows shows an example of average encoding time per pictureusing a number of WPP threads in example encoders, comparing averageencoding time for slice parallel encoding with WPP threads to sliceparallel encoding with WPP threads combined with multi-thread processingof temporal level 3 (TL-3) pictures.

FIG. 15 shows an example of sequential processing of slice compression,deblocking and Sample Adaptive Offsets (SAO) processing.

FIG. 16 shows an example of parallel processing of slice compression,deblocking and SAO processing based on a coding tree unit (CTU) row.

FIG. 17 shows an example where a loop filtering (deblocking and SAO)process may start after CTU compression of the first 3 CTU rows arecompleted.

FIG. 1.8A shows an example of a reference picture encoding status.

FIG. 18B shows an example of a current picture encoding status.

FIG. 19 shows an example of average encoding time for TL-3 pictures fora number of WPP threads.

FIG. 20 shows an example of scheduling WPP threads for TL-0/1/2 picturesand scheduling new threads for next picture encoding.

FIG. 21 shows an example of scheduling WPP threads for TL-0/1/2 picturesand reusing the threads for next picture encoding.

FIG. 22 shows an example of a multi-GOP encoding architecture.

FIG. 23 shows an example of average encoding time per frame fordifferent bit streams using different examples of HEVC optimizationtechniques.

FIG. 24 shows Rate-Distortion (RID) curves for an optimized HEVC encodercompared to an X265 encoder (v 1.0) for a ParkScene bit stream.

FIG. 25 shows Rate-Distortion (RD) curves for an optimized HEVC encodercompared to an X265 encoder (v 1.0) for a BasketballDrive bit stream.

FIG. 26A presents an example where threads are allocated in codingorder.

FIG. 26B presents an example where threads are allocated based onpriority.

FIG. 27A presents an example where threads are migrated in coding order.

FIG. 27B presents an example where threads are migrated based onpriority.

FIG. 28A is a system diagram of an example communications system inwhich one or more disclosed techniques may be implemented.

FIG. 28B is a system diagram of an example wireless transmit/receiveunit (WTRU) that may be used within the communications systemillustrated in FIG. 28A.

FIG. 28C is a system diagram of an example radio access network and anexample core network that may be used within the communications systemillustrated in FIG. 28A.

FIG. 28D is a system diagram of another example radio access network andan example core network that may be used within the communicationssystem illustrated in FIG. 28A.

FIG. 28E is a system diagram of another example radio access network andan example core network that may be used within the communicationssystem illustrated in FIG. 28A.

DETAILED DESCRIPTION

A detailed description of illustrative embodiments will now be describedwith reference to the various Figures. Although this descriptionprovides a detailed example of possible implementations, it should benoted that the details are intended to be exemplary and in no way limitthe scope of the application.

High Efficiency Video Coding (HEVC) may support parallel processing, forexample, for multi- and many-core architectures. HEVC may supportdifferent picture partition strategies, such as slices, wavefrontparallel processing (WPP) and tiles, for example, for high-levelparallelism.

FIG. 1 shown an example system in which HEVC video coding technologiesmay be implemented. The system may include a Hypertext Transfer Protocol(HTTP) based video streaming system. A media capture device 102 maycapture media (e.g., video, photograph, audio, etc.). In an example, asdepicted on FIG. 1, the media capture device 102 may include a videocamcorder, camera, voice recorder, smart phone, tablet, or any otherdevice that has media capturing abilities.

The media captured from the media capture device 102 may be transferredto a media preparation device 104, for preparing the media. In someexamples, the media collection device 102 and the media preparationdevice 104 may be independent devices. In other examples, however, themedia collection device 102 may include the capability to collect themedia as well as the ability to prepare the media. The media collectiondevice 102 may also serve as the media preparation device 104. In anexample, the media preparation device 104 may compress and chop themedia into small segments, where a segment period is, for example,between two and ten seconds of video.

After the media is prepared by the media preparation device 104, themedia may be transferred to a wireless transfer receive unit (WTRU) 110a, b, c, d, e, and/or f (collectively, referred to as 110). In anexample, the media may be transferred to a WTRU 110 from a media HTTPorigin server 106. In some examples, HTTP cache devices 108 may be usedto store previously transferred media to assist in the delivery of themedia from HTTP origin server 106 to WTRU 110.

FIG. 2 is a block diagram of an example block-based video encoder 200 inwhich HEVC video coding technologies may be implemented. In an example,video encoder 200 may be a single layer video encoder. Video encoder 200may be used, for example, to generate bit streams for a video streamingsystem, such as the video streaming system shown in FIG. 1. As shown inFIG. 2, the encoder 200 may employ techniques, such as spatialprediction 202 (e.g., intra prediction) and temporal prediction (e.g.,inter prediction and/or motion compensated prediction) to predict theinput video signal and achieve efficient compression.

Mode decision logic 204 may select the most suitable form of prediction.Selection criteria may be based on a combination of rate and distortionconsiderations. In an example, the encoder may transform 206 andquantize 208 a prediction residual, where a prediction residual may be adifference signal between the input signal and the prediction signal. Aquantized residual, together with the mode information (e.g., intra orinter prediction) and prediction information (e.g., motion vectors,reference picture indexes, intra prediction modes) may be furthercompressed by entropy coder 210, which may generate an output videobitstream 216.

As shown in FIG. 2, video encoder 200 may generate a reconstructed videosignal by applying an inverse quantization 212 and inverse transform 214to the quantized residual to obtain a reconstructed residual, which maybe added to the prediction signal. The reconstructed video signal may beprocessed through a loop filter (LF) process 218 (e.g., deblockingfilter (DB), Sample Adaptive Offsets (SAO) or Adaptive Loop Filters(ALF)). Reconstructed video blocks may be stored in reference picturestore 220. Reconstructed video blocks may be used to predict a futurevideo signal.

FIG. 3 shows a block diagram of an example video playback system inwhich HEVC video coding technologies may be implemented. Video playbacksystem may include a receiver 302, decoder 304, and display (renderer)306. Receiver 302 may receive the media. In some examples, the mediareceived by the receiver 302 may include encoded media. The decoder 304may receive the encoded media from the receiver 302 and may decode theencoded data. The decoder 304 may decode the encoded and transmit thedecoded data to the display 306 for viewing by the user.

FIG. 4 shows an example block-based video decoder. Video decoder may bea single layer video decoder. Video decoder may receive a videobitstream produced by an encoder, such as the encoder shown in FIG. 2.Video decoder may reconstruct the video signal, e.g., for display on adisplay device 412. Entropy decoder 402 may parse video bitstream 418.Residual coefficients may be inverse quantized 414 and inversetransformed 416 to obtain a reconstructed residual. Coding mode andprediction information, e.g., generated by an encoder, may be used toobtain a prediction signal, for example, using either spatial prediction404 or temporal prediction 406. A reconstructed video signal may begenerated, for example, by adding the prediction signal andreconstructed residual. The reconstructed video signal may be processedthrough a loop filter (LF) process 408. The reconstructed video may bestored in a reference picture store 410. Reconstructed video may bedisplayed, via, for example, a display device 412, and/or used to decodea future video signal.

HEVC may provide block based hybrid video coding and decoding. An HEVCencoder and decoder may, for example, operate in accordance withexamples of an encoder and decoder in FIGS. 2 and 4. HEVC may allow useof larger video blocks. HEVC may use quadtree partition to signal blockcoding information. A picture or slice may be partitioned into codingtree units (CTUs), which may be the same size (e.g., 64×64). A CTU maybe partitioned into coding units (CUs), e.g., using quadtree-basedpartition. A CU may be partitioned into prediction units (PUs) andtransform units (TUs), e.g., using quadtree-based partition.

A PU associated with an inter coded CU may be, for example, one of eight(8) partition modes. In an example, FIG. 5 shows an example of animplementation of HEVC with eight prediction unit (PU) modes for aninter-coded coding unit (CU).

An encoding process may involve inter picture prediction. Samples for ablock may be predicted, for example, based on selected motion data, suchas a reference picture and a motion vector (MV).

An encoder and decoder may generate identical inter picture predictionsignals, for example, by applying motion compensation (MC) using anidentical MV and mode decision data. The MV and mode decision data maybe transmitted in a bitstream received by a decoder. Linear filters maybe applied to obtain pixel values at fractional positions. Applicationof linear filters may depend on the precision of motion vectors. As anexample, precision may be a quarter pixel. Interpolation filters mayhave, as an example, 7 or 8 taps for luma and 4 taps for chroma. Aresidual signal of intra or inter picture prediction (e.g., a differencebetween an original block and its prediction) may be transformed by alinear spatial transform.

Transform coefficients may be scaled, quantized, entropy coded, andtransmitted together with prediction information. An encoder mayduplicate a decoder processing loop, for example, so that encoder anddecoder generate identical predictions for subsequent data. Quantizedtransform coefficients may be constructed by inverse scaling and inversetransforms. A decoded approximation of the residual signal may begenerated. The residual may be added to the prediction. Deblocking (DB)and SAO filters may operate on the result, e.g., to smooth out artifactsinduced by block-wise processing and quantization. A final picturerepresentation may be stored in a decoded picture buffer. Decodedpictures may be used in prediction of subsequent pictures.

A deblocking filter may be content based. Different deblocking filteroperations may be applied at the TU and PU boundaries. Filter operationsmay depend on a number of factors, such as coding mode difference,motion difference, reference picture difference and pixel valuedifference.

Entropy coding may comprise context-based adaptive binary arithmeticcoding (CABAC). CABAC may be applied, for example, to most block levelsyntax elements. CABAC may not be applied to high level parameters.By-pass coding may be a special case in CABAC coding. For example, equalprobability may be used to code binary symbols 0 and 1.

Video codecs, such as H.264/MPEG-4 AVC, may be parallelized. Parallelismmay be, for example, frame-level, slice-level or macroblock level. Someapproaches may have limited scalability, significant coding losses orlarge memory requirements.

HEVC coding tools, such as wavefront parallel processing (WPP) andtiles, for example, may facilitate high-level parallel processing. WPPand tile may allow subdivision of a picture into partitions that may beprocessed in parallel. A partition may have an integer number of codingtree units (CTUs). CTUs for a partition may or may not have dependencieson CTUs for other partitions. Coding tools (e.g., WPP or tiles) may beenabled and disabled. A bitstream may indicate entry point offsets, forexample, signaled in a slice header, indicating a start position forentropy decoding of a partition.

HEVC encoder implementations, e.g., X265, may use WPP technology. WPPmay enable parallel processing within one slice. WPP may be extended asoverlapped wave-front (OWF). OWF may enable picture level parallelencoding. OWF technology may be considered as overlapping encoding ofconsecutive pictures using wave-fronts. OWF may enhance the efficiencyof WPP. OWF may mitigate inefficiency of WPP caused by CABAC contextsynchronization. As an example, processing may begin on a next pictureinstead of waiting to complete encoding for a current picture, forexample, when a thread has finished a CTU row in a current picture andthere are no additional CTU rows to process.

Encoding may be performed faster, for example, by utilizing multi-corecapabilities of CPUs in combination with parallelization techniques.Optimization techniques may achieve fast HEVC encoding withoutsacrificing compression performance.

Real time encoding performance may be improved. Some encoders, forexample, HEVC reference software encoder or an HM encoder, optimizecoding efficiency while maintaining a certain video quality. Videoquality may be measured with objective quality metrics, such as peaksignal to noise ratio (PSNR), video quality metric (VQM) and structuralsimilarity index (SSIM). Video quality may be measured subjectively withhuman observers. Parallel processing approaches, such as Slices, WPP andTiles, may improve encoding speed. However, an improvement in encodingspeed may result in a loss of compression efficiency. For example,Slices and Tiles may break entropy encoding and prediction dependencies,which may prevent prediction across Slice or Tile boundaries.

WPP may result in a lower loss of compression efficiency compared toSlice and Tile parallel processing techniques. For example, WPP may beutilized to partition a picture into CTU rows while permittingprediction and entropy coding across CTU row boundaries.

A WPP technique may, for example, encode a CTU row by one thread.Multiple threads may operate in parallel. There may be a processingdelay (e.g., a delay of two CTUs). A delay may be due to CABAC contextsynchronization between a current CTU row and its top CTU row. Delaysmay introduce parallelization inefficiencies. Inefficiencies may becomemore evident when a high number of WPP threads are used for encoding.

An OWF technique may have processing dependencies, e.g., caused bymotion search. A future picture may be created using reconstructedsamples of a reference picture for motion estimation. Reference pixelsused for motion estimation may have already been processed by in-loopfilters (e.g., deblocking, SAC)). A CTU may not be ready for encodinguntil reference pixels within a motion search area in a referencepicture have been encoded. This motion search dependency may limit thethroughput of frame-level parallel processing threads encoding multipleframes.

As an example, assume there are three pictures P0, P1 and P2. P0 may bea reference picture for P1. P1 may be a reference picture for P2. Threethreads T0, T1, T2 may be allocated to process the three pictures. Asearch window size may be SW×SH. A prediction unit (PU) size may bePW×PH. A PU of size PW×PH in P2 may have to wait for (SW+PW−1)×(SH+PH−1)pixels in P1 to complete encoding, and those (SW+PW−1)×(SH+PH−1) pixelsin P1 may have to wait for (PW+2*SW−2)×(PH+2*SH−2) pixels in P0 to beencoded. As a result, thread T2 may be idle, waiting most of the timethat threads T0 and T1 are working to finish encoding pictures P0 andP1. Delays may be amplified when the center of a search window, whichmay be determined by a motion predictor, is located toward an extremityof reference pictures P0 and/or P1. An extremity of a reference picturesmay take longer to reach, such as when threads T0 and T1 work from oneextremity to another (e.g., top to bottom) of a picture.

A challenge for encoder implementations, e.g., an HEVC encoderimplementation, is to fully utilize available CPU computation resourcesfor real-time encoding without compromising coding efficiency. CABACdependency, among other dependencies, such as motion vector predictionbetween CTU rows, leads to idle frame processing threads waiting forthreads processing reference pictures to finish.

Various optimization techniques may be used to reduce HEVC encodingtime, e.g. on a platform with multi-core CPUs, without compromisingcompression efficiency, for example, compared to compression efficiencyof an HM reference encoder.

HEVC encoding process optimization may include, but not limited to,instruction level optimization and process level optimization.Instruction level optimization may be implemented, for example, bySingle Instruction Multiple Data (SIMD) instructions. Instruction leveloptimization may be applied, for example, to various time consumingmodules in an encoder, such as motion compensation, distortioncalculation in integer motion estimation by sum of absolute differences(SAD), distortion calculation measured by sum of absolute transformeddifferences (SATD) in fractional motion estimation (Hadamard transform),distortion calculation measured by sum of square errors (SSE) in ratedistortion optimization (RDO) based mode decision, transform, inversetransform, quantization, de-quantization and reconstruction.

A multi-threaded (MT) parallel processing framework efficiently usingCPU resources (e.g., multi-core resources) may use parallel processingtechniques at CTU level within a slice and at picture level. Picturelevel parallel processing may depend, for example, on temporal levelsconsidering a picture's referencing hierarchy. An example video codingdevice (e.g., an encoding device, a decoding device, etc.) may include aprocessor. In an example, the video coding device may receive a videosequence including first temporal level pictures and second temporallevel pictures. The video coding device may allocate, based on temporallevel priority, a number of parallel processing threads for coding(e.g., encoding and/or decoding) the first temporal level pictures andanother number of parallel processing threads for coding the secondtemporal level pictures. The video coding device may code, e.g., thefirst temporal level pictures and the second temporal level picturesbased on the allocation of the first number of parallel processingthreads and the second number of parallel processing threads.

FIG. 6 is an example of an HEVC implementation with prioritized,hierarchical processing of HEVC coding when a group of pictures (GOP)size is 8. Pictures may be separated into different temporal levels.Pictures at a temporal level may refer to pictures at the same or lowertemporal level F. FIG. 6, as an example, shows 9 pictures (e.g., pictureorder count (POC) 0, POC 1, POC 2, POC 3, POC 4, POC 5, POC 6, POC 7,and POC 8) being arranged according to 4 temporal levels (e.g., TL_0,TL_1, TL_2, TL_3). As depicted on FIG. 6, POC 0 and POC 8 may be thepictures having the lowest temporal level (and the highest priority),and POC 1, POC 3, POC 5, and POC 7 may be the pictures having thehighest temporal level (and the lowest priority). Pictures at a lowertemporal level may have a larger impact on whole sequence coding, forexample, because more pictures may refer to (e.g. may depend on)pictures from the lower temporal levels. Because of their higherpriorities, high impact pictures with smaller quantization parametersmay be allocated with more bits. Further, in some examples, pictures ata highest temporal level may not be used as reference pictures, as isthe case for the example shown in FIG. 6.

FIG. 7 shows an example architecture for a multi-level, multi-threadedparallel processing framework for HEVC encoding. FIG. 8 shows an exampleof picture encoding order for a multi-threaded framework. As depicted onFIG. 6, pictures having the non-lowest priorities (e.g., POCS 0, 2, 4,6, and 8) may be grouped together, and pictures having the lowestpriorities (e.g., POCS 1, 3, 5, and 7) may be grouped together. Forillustration purposes, the pictures having the lowest priorities may bedenoted as temporal level 3 (TL_3). For example, temporal level 3pictures may include non-reference pictures. A picture that is notreferenced by another picture may belong to temporal level 3.

FIG. 9 shows an example of profiling results in terms of encoding timeas a percent of overall encoding time for HEVC reference encoderHM-12.0. Profiling results are shown for various modules in the encodingprocess, including interpolation, SAD (integer motion estimation),Hadamard Transform (fractional motion estimation), SSE (distortioncalculation using Sum of Squared Errorin RDO based mode decision,inverse transform etc. Profiling is based on full HD (1920×1080)encoding carried out on an Intel quad core i7 CPU. Profiling resultsindicate that interpolation, Hadamard transform, SAD, SSE, and integertransform are the most time consuming encoding modules. One or more ofthese modules may be optimized using Streaming SIMD Extension (SSE) 4.1instructions. SIMD instructions may use 128-bit registers, e.g., insteadof 32-bit registers, for operations to improve encoding speed. In anexample using an SIMD optimization technique, overall encoder speedimproved by 46-50%.

A multi-level multi-threaded parallel processing framework may beapplied at picture level and/or a slice level. Multi-threaded parallelprocessing framework using WPP technology may be applied, for example,to process a CTU row of a slice/picture in parallel. Picture/frames maybe categorized into multiple levels, such as two, three, four or thelike, levels.

Multi-threaded processing may be applied within one slice using WPPmethodology. A parallelization scheme may exploit CTU-row levelparallelism. A slice encoding thread manager may be responsible forcreating a number of CTU-row compression threads (WPP threads). A threadmay perform compression of a CTU row in a slice. WPP methodology may beused to enable an individual/independent entropy coder for a CTU row.With WPP enabled, a CABAC entropy coder for a current CTU row may beinitialized, for example, based on the CABAC entropy coder's statusafter the first two CTUs of the CTU row above the current row areencoded.

In an example, eight (8) CTU row encoding threads, which is equivalentto the number of hyper threads available in a quad core i-7 CPU, may beused to encode eight (8) CTU rows of a slice in parallel. Note, however,that the use of 8 CTU rows is a non-limiting example and is used forillustrative purposes only. Further, an ideal encoding system has nodependency among CTU row encoding threads. In an example implementationof an encoder, there may be several types of delay, e.g., a CABACinitialization delay and a motion delay.

FIGS. 12A and 12B show, respectively, examples of CABAC dependency andmotion dependency using WPP methodology. Encoding time of a frame isgiven by Eq. 1.

T _(Anchor) =W*H*Δ  Eq. 1

In Eq. 1, λ is an average time to encode a CTU, W is the number of CTUsin a CTU row and H is the number of CTU rows in the picture. Encodingtime for a frame is given by Eq. 2, for example, when WPP is used with NWPP threads to encode a picture.

$\begin{matrix}{{T_{WPP} = {\frac{W*H*\Delta}{N} + T_{CABAC\_ Delay} + T_{Motion\_ Delay}}}{T_{CABAC\_ Delay} = {\left\lbrack {2*\Delta*\left( {N - 1} \right)} \right\rbrack*\frac{H}{N}}}} & {{Eq}.\mspace{14mu} 2}\end{matrix}$

Motion delay of a CTU at the i-th row and the j-th column may becalculated according to Eq. 3, for example, when two WPP threads areused for compression.

$\begin{matrix}{{{\Delta \; t_{i,j}} = {{DELAY}\left( {T_{{i - 1},{j + 1}},T_{i,{j - 1}}} \right)}}{{{DELAY}\left( {{T\; 0},{T\; 1}} \right)} = \left\{ \begin{matrix}{0,} & {\left( {{T\; 0} - {T\; 1}} \right) < 0} \\{{{T\; 0} - {T\; 1}},} & {\left( {{T\; 0} - {T\; 1}} \right) \geq 0}\end{matrix} \right.}} & {{Eq}.\mspace{14mu} 3}\end{matrix}$

Motion delay of a CTU at the i-th row and j-th column may be calculatedaccording to Eq. 4, for example, when N threads are used.

Δt _(i,j)=DELAY(T _(t−1,j+1) +Δt _(t−1,j+1) ,T _(i,j−1))  Eq. 4

Motion delay for processing a whole picture may be calculated accordingto Eq. 5.

$\begin{matrix}{T_{Motion\_ Delay} = {\left( {\sum\limits_{i = 1}^{N}{\sum\limits_{j = 1}^{W - 2}{\Delta \; t_{i,j}}}} \right)*\frac{H}{N}}} & {{Eq}.\mspace{14mu} 5}\end{matrix}$

Motion delay time may increase as the number of threads increases, forexample, because a CTU compression may depend on CTU compression for aprevious row.

FIG. 13 shows an example of the average encoding time per frame based ona number of WPP threads in an example encoder. Each bar andcorresponding curve represents a comparison of encoding in real timemode (e.g., each bar) without a motion delay time (e.g., eachtheoretical curve). In the example shown in FIG. 13, a test bitstream isa BasketballDrive sequence coded with quantization parameter (QP) 28.Parameters used in the example encoder comprise BasketballDrive1920×1080 QP 28 on an Intel 1-7 2600 CPU with 4 core (8 Threads).

The curve in FIG. 13 shows a theoretical encoding time without motiondelay, e.g., with T_(Motion_Delay) in Eq. 5 set to zero, and average CTUencoding time A estimated from encoding time without WPP. In otherwords, the curve describes the theoretical relationship between encodingtime and number of threads if motion delay does not exist while barsaccount for motion delay. The bars in FIG. 13 show average encodingtime, per picture, using a number of WPP threads measured using a WPPtechnique. As shown in FIG. 13, a rate of reduction in encoding time maysubstantially decrease with additional throughput (e.g. additionalthreads assigned to the same picture), for example, when more than fourWPP threads are used for a compression process. This effect may bereferred to as a “throughput saturation” effect.

FIG. 13 shows that the first three bars such as bars 1, 2, and 3 closelytrack the theoretical encoding time curve without motion delay. However,as depicted on FIG. 13, as the number of WPP threads increases beyondthree, the rate of reduction in encoding time is slower than thetheoretical rate of reduction without motion delay. This difference inrate of decrease in decoding time may be observed by the increasing gapbetween the top of each respective bar and the curve as the number ofWPP threads increases beyond three.

Further, motion dependency may reduce HEVC encoder throughput as thenumber of WPP encoding threads increases, for example, because CTUencoding may become unbalanced. Motion delay may be reduced by balancinga CTU encoding, e.g., for the top left part of the slice. An encoder mayapply a complexity control technique for a top left part of a slice. Anencoder may terminate a motion estimation process early, for example,when a number of check points in motion estimation exceeds a thresholdsuch as 3, 4, 5, or the like.

Parallel processing using a multi-threaded framework may be applied to agroup of pictures (GOP) in multiple stages (e.g., two stages).

The number of threads for coding (e.g., encoding and/or decoding) thesecond-temporal level pictures may be zero before coding of thefirst-temporal level pictures is completed. The number of threads codingthe first-temporal level pictures may become zero after completing thecoding of the first-temporal level pictures. In an example, threads(e.g., parallel processing threads) may be evenly distributed among thesecond-temporal level pictures. In other examples, however, threads maybe distributed unevenly among the first-temporal level pictures (e.g.,based on picture priority).

For example, in a first stage, parallel processing may be applied fortemporal level 0 (TL-0), temporal 1 (TL-1), and temporal 2 (TL-2)pictures. POCS 8, 4, 2, and 6 may be encoded in parallel. As an example,threads may be allocated to a picture within a temporal level, based onthe amount of times other pictures reference the picture.

In a second stage, temporal level 3 (TL-3) pictures may be coded. Forexample, TL-3 pictures, such as POC 1, POC 3, POC 5 and POC 7, e.g., ina random access main configuration of HEVC encoding, may lack dependencyamong themselves but may have dependency on non TL-3 pictures. Forexample, POC 8, POC 4, POC 2, POC 6, and POC 0 may be used as referencepictures to encode TL-3 pictures. Parallel processing may be applied forthe four lowest priority pictures (POC 1, POC 3, POC5, and POC 7). POC1, 3, 5 and 7 may be encoded in parallel. WPP encoding threads may beallocated to TL-3 pictures evenly, for example, when there is nodependency among TL-3 pictures. The four TL-3 pictures may be processedin parallel with four picture level encoding threads. A picture thread(e.g. each of the four picture level encoding threads) may use WPPthreads internally to encode a respective picture. In an example,original pixels may be used for motion estimation, e.g., to removemotion search dependency for temporal level 0, 1 and 2 pictures. In anexample, reconstructed pixels may be used for motion estimation.

Picture level parallel processing may be applied to TL-3 pictures usingmulti-threaded parallel processing framework, for example, afterTL-0/1/2 picture encoding in a first stage completes. Loop filter andmotion search region dependency limitations may be reduced or avoidedwhen using an OWF technique.

An example of the multi-threaded framework is shown on FIG. 7, whereinpicture-level thread 0 (e.g., TH-0) may be applied to POC 8 in a firststage and to POC 1 in a second stage; picture-level thread 1 (TH-1) maybe applied to POC 4 in a first stage and to POC 3 in a second stage;picture-level thread 2 (TH-2) may be applied to POC 2 in a first stageand to POC 5 in a second stage; and picture-level thread 3 (TH-3) may beapplied to POC 6 in a first stage and to POC 7 in a second stage.Picture-level threads (TH-0/1/2/3) may manage WPP threads. WPP threadsmay process one CTU row. In a second stage, WPP threads may bedistributed equally for the picture-level threads (TH-0/1/2/3). In afirst stage, WPP threads may be distributed dynamically amongpicture-level threads. For example, N WPP threads may be assigned (e.g.,successively assigned) to the first picture in coding order, POC-8, thenN WPP threads may be assigned (e.g., successively assigned) to thesecond picture in coding order, POC-4, and/or according to picturecoding order (e.g. as illustrated in FIG. 8, or FIG. 11). Theseassignments of WPP threads to non-TL_3 pictures may be done dynamically,based on the availability of WPP threads, and/or based on the clearingof dependencies for CTU rows.

FIG. 10 shows an example of an encoding order for an HEVC encoder withrandom access main configuration. The order shown in FIG. 10 may be usedin HM reference software. FIG. 11 shows an example of a modifiedencoding order for an HEVC encoder with random access mainconfiguration. FIG. 1 shows the pictures having the lowest priority(e.g., POC 3, POC 5, and POC 7), each found in TL-3, being ordered last.

A multi-threaded framework may create a temporal level picture encodingthread manager at a top level encoding. A thread manager may beresponsible for creating and managing threads to encode TL-3 pictures inparallel. In an example, a TL-3 encoding thread may be in charge of oneTL-3 picture encoding process. The temporal level picture encodingthread manager may synchronize an encoding process between encoding ofTL-0/1/2 pictures in a first stage and encoding of TL-3 pictures in asecond stage.

In an example using multi-threaded encoding and a picture processedusing one WPP thread, a 23-25% improvement was achieved in encoder speedin comparison to HEVC reference encoder HM-12.0.

Slice compression of a picture may be performed with a dynamic number ofWPP threads. In an example, an i-7 quad core processor may be used forencoder profiling in a multiple picture parallel encoding technique. AWPP encoding process of non-TL-3 pictures may be configured to use 8 WPPthreads and TL-3 pictures may be configured to use 2 WPP threads so that8 WPP threads (e.g., 4 TL-3 Pictures threads*2 WPP threads/picture) areoperated during the entire process of encoding the TL-3 pictures.

In an example, a 5-10% encoder performance improvement is achieved incomparison to a parallel encoding (WPP) technique using eight WPPthreads for a slice/picture encoding. Performance improvement on anIntel i-7 quad core processor may be limited due to the number of CPUcores.

FIG. 14 shows an example of the average encoding time per picture usinga number of WPP threads in example encoders, comparing average encodingtime for slice parallel encoding with WPP threads to slice parallelencoding with WPP threads combined with multi-thread processing of TL-3pictures. Parameters used in the example encoder compriseBasketballDrive 1920×080 QP 28 on an Intel 1-7 2600 CPU with 4 core (8Threads). N may denote as the number of WPP threads, as shown in FIG.14. In an example, the combined technique uses N WPP threads forTL-0/1/2 picture encoding, 1 WPP thread per TL-3 picture when N is lessthan or equal to 4, and 2 WPP threads per TL-3 picture when N is morethan 4. A combined technique may be faster regardless of the number ofthreads, but the margin of improvement may decrease as the number ofthreads increases.

Pictures of a temporal level may be coded (e.g., encoded and/or decoded)in parallel using a number of the parallel processing threads allocatedfor that temporal level. Parallel processing using a multi-threadedframework may be applied, for example, to pictures at lower temporallevels such as TL-0/1/2. There may be strong dependencies among TL-0,TL-1 and TL-2 pictures. In some examples, WPP threads may be allocatedevenly among pictures of a temporal layer. In other examples, however,WPP threads may not be allocated evenly. Encoding threads may beallocated dynamically to improve encoding speed. For example, thethreads may allocated based on round robin scheduling, priorityscheduling, lottery scheduling and/or others.

Multi-threaded processing of TL-0, 1 & 2 pictures (e.g., inter frameparallel processing) may be performed. For example, an OWF parallelprocessing technique may be used to encode pictures of a high prioritytemporal layer using multi-threaded parallel processing. Amulti-threaded parallel processing framework's high priority temporallayer (e.g., non-TL3) picture encoding thread manager may initiateencoding of a first picture. Other threads responsible for encodingother pictures may be waiting for completion of Sample Adaptive Offset(SAO) processing of coding tree units (CTUs) within motion search rangein their respective reference pictures. For example, FIG. 18B shows anexample of current picture encoding status while FIG. 18A showsreference picture encoding status.

Slice encoding thread manager may signal to the next picture encodingprocess that it may start encoding, for example, after the currentpicture (which may be a reference picture for the next picture)completes loop filter processing (e.g., deblocking and/or SAO) for CTUswithin search range.

Sequential processing of slice compression, deblocking (DB) and SAOprocessing may be used. FIG. 15 shows an example of sequentialprocessing of slice compression 1502, deblocking 1504 and SAO processing1506. Sequential processing may make it more difficult to achieve fasterencoding with an OWF parallel processing technique. A sequentialencoding processing shown in FIG. 15 may be modified, for example, asshown in FIG. 16.

FIG. 16 shows an example of parallel processing of slice compression1602, loop filtering (e.g., deblocking) 1604 and SAO processing 1606based on CTU row. Deblocking 1604 and SAO processes 1606 may be executedfor a CTU row. CTU compression 1602. Deblocking 1604 and SAO processes1606 may be performed in parallel.

An SAO process 1606 may be performed for a CTU row, for example, aftercompletion of deblocking process for the CTU row. This process may berepeated for a picture. CTU encoding status may be updated to the nextpicture's CTU compression threads through slice encoding thread manager,for example, after completion of a loop filtering process of a CTU rowin a current picture. Synchronization overhead may be reduced on a nextencoded picture's CTU compression threads.

Picture level parallel processing with thread management may be referredto as inter picture wave-front parallel processing (IPWPP) herein. Anexample of the IPWPP encoding process is shown in FIGS. 17 and 18.Referring to FIG. 17, a loop filtering (e.g., deblocking and SAO)process may start after CTU compression of the first 3 CTU rows arecompleted. Loop filtering may operate in parallel with CTU compressionthreads thereafter. An example of dependency between CTUs for a currentpicture and CTUs for a reference picture is shown in FIGS. 18A and 18B.

FIG. 18B shows an example of a current picture encoding status and FIG.18A shows an example of a reference picture encoding status. In anexample, a search window size may be 128×128. In FIG. 18B, CTU row X inthe current picture may start compression, for example, when pixelswithin its search range in the reference picture (indicated as blocksmarked by a dashed rectangle) have completed compression and in-loopfiltering (ILF) shown in FIG. 18A.

IPWPP may be used, for example in combination with other techniques,e.g., TL-3 WPP multi-threaded parallel processing, to encode multiplegroups of pictures. FIG. 22 shows an example of a multi-GOP encodingarchitecture. As shown on FIG. 22, non-TL_3 pictures (e.g.,first-temporal level pictures), including POCs 8, 4, 2, and 6 may belongto a same group (e.g., GOP-(n−1)) as TL_3 pictures (e.g.,second-temporal level pictures), including POCs 1, 3, 5, and 7. In anexample, the TL_3 pictures (e.g., POC 1, 3, 5, and 7) denoted within afirst GOP (e.g. GOP-(n−1)) may be processed in parallel with non-TL_3pictures (e.g., POC 16, 12, 10, and 14) denoted within a second GOP(e.g., GOP-n). In an example, the TL_3 pictures (e.g., POC 1, 3, 5, and7) within the first GOP (e.g. GOP-(n−1)) may be processed in parallelwith the non-TL 3 pictures (e.g., POC 16, 12, 10, and 14) of the secondGOP (e.g., GOP-n), after encoding of at least a portion of the non-TL_3pictures (e.g., POCs 8, 4, 2, and 6) of the first GOP (e.g., GOP-(n−1))has completed.

IPWPP may use efficient signaling between WPP threads of current andnext encoding pictures. Synchronization overhead between WPP threads maybe minimized. WPP thread scheduling may be implemented to utilizemaximum CPU capabilities. The effect of throughput saturation, which maybe due to CABAC and motion vector prediction dependencies, may bereduced. Efficient thread scheduling may be utilized to achieve maximumCPU capabilities for TL-0/1/2/3 picture encoding.

Free WPP threads from a current picture may be assigned to a nextpicture encoding, for example, after allocating the current picture CTUrows for compression. WPP threads may be scheduled for next pictureencoding. For example, whether to allocate threads for coding a nextpicture may be determined based on the coding complexity of the currentpicture, the coding complexity of the next picture, and/or the effect ofthroughput saturation.

Thread management for parallel encoding of pictures within a GOP may beperformed. The example results shown in FIGS. 13 and 14 indicate thatreduction in encoding time by increasing throughput may saturate. Forexample, an implementation using more than six WPP threads may notsignificantly improve encoding speed. Intel i7 quad core CPUcapabilities may not be fully exploited, for example, for one or morereasons (e.g., CABAC and motion dependency). CTU encoding threads may bescheduled, for example, so that a light weight thread starts deblockingand SAO processing for the current picture after completing acompression process for two CTU rows. There may be dependency betweencompression and deblocking. e.g., as shown in FIG. 17. The actual numberof CTU rows affected by motion dependency may depend, for example, onthe motion search range and the CTU block size. For example, assume thesearch range is 128×128 and CTU is 64×64. Motion dependency may causedelays of two CTU rows.

In an example, a next picture encoding process may start after thecurrent picture completes loop filtering of three CTU rows. Theadditional CTU row delay may occur. A compression process for a currentpicture CTU row X may start, for example, after reference picture CTUrow X and (X+1) deblocking and SAO is completed. Reconstructed pixelsfrom row (X+2) in reference picture may be needed to complete deblockingand SAO of row (X+1) in reference picture. Accordingly, there may be athree CTU row delay between the current picture and the referencepicture, based on the assumption of a 128×128 search range and a 64×64CTU size.

One or more WPP threads from a current picture compression may beallocated or assigned to process the next picture compression. Thenumber of threads scheduled for next picture processing may depend, forexample, on the next picture encoding complexity. TL level may indicatecomplexity. For example, a lower TL (e.g., TL-0 or TL-1) pictureencoding complexity may be higher than encoding complexity of a higherTL (e.g., TL-2) picture.

The number of threads to be allocated to a next picture may bedetermined based on a number of maximal WPP threads used for coding apicture, the estimated encoding time of current picture, and/or theestimated encoding time of next picture. The number of WPP threadsconsidered for thread switching from the current picture to the nextpicture may be estimated in accordance with Eq. 6, for example, afterestimating current and next picture encoding complexity.

N _(Switch_Threads)=max(1,Int(γ*N _(Maximal_Threads)))  Eq. 6

In Eq. 6, N_(Switch_Threads) is the number of WPP threads switched tonext picture encoding, N_(Maxmial_Threats) is the number of maximal WPPthreads used for one picture encoding and γ is the thread switchingfactor. Thread switching factor γ may be calculated based on therelative encoding complexity of the current picture and the nextpicture, for example, in accordance with Eq. 7,

$\begin{matrix}{\gamma = \left\lbrack \begin{matrix}0.10 & {{{if}\mspace{14mu} \frac{T_{Next}}{T_{Curr}}} \leq 0.6} \\0.25 & {{{if}\mspace{14mu} 0.6} < \frac{T_{Next}}{T_{Curr}} \leq 0.8} \\0.40 & {{{if}\mspace{14mu} 0.8} < \frac{T_{Next}}{T_{Curr}}}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 7}\end{matrix}$

Eq. 7, T_(Curr) is the estimated encoding time of current picture andT_(Next) is the estimated encoding time of next picture.

A CTU encoding thread switch may happen after W_(CompressedCTURows)number of CTU rows in the current picture are completed, for example, inaccordance with Eq. 8.

W _(CompressedCTURows) =W _(MaxCTURows)*λ  Eq. 8

In Eq. 8, n, factor can be chosen, for example, in accordance with thelimitation in Eq. 9.

$\begin{matrix}{0.4 \leq \lambda \leq {\max\left( {0.4,\frac{W_{MaxCTURows} - \left( {N_{Maximal\_ Threads}*\frac{T_{Next}}{T_{Curr}}} \right)}{W_{MaxCTURows}}} \right)}} & {{Eq}.\mspace{14mu} 9}\end{matrix}$

WPP threads may be shared (e.g., shared equally) among multiple TL-3pictures, for example, after completing temporal level 0, 1 and 2pictures. A picture may use, for example, two WPP threads for slicecompression.

WPP threads may be scheduled for TL-3 picture encoding to improveencoding performance. FIG. 19 shows an example of average encoding timefor TL-3 pictures for a number of WPP threads. One or more additionalWPP threads may be assigned to a TL-3 picture when more CPU cores areavailable. FIG. 19 compares performance of an optimized HEVC encoderusing 2, 4 and 8 WPP threads for TL-3 picture encoding. For example, asshown in FIG. 19, a 1.614 multiple performance improvement may beachieved using four WPP threads for a TL-3 picture relative to using twoWPP threads. As shown, WPP threads may be scheduled for TL-0/1/2 pictureencoding to improve encoding performance.

Inter picture wavefront parallel processing may be used to code TL-0/1/2pictures in parallel. WPP scheduling may be accomplished by a variety ofdifferent techniques. FIG. 20 shows an example of scheduling WPP threadsfor TL-0/1/2 pictures and scheduling new threads for next pictureencoding. A slice compression process may start using a pre-selectednumber of WPP threads (e.g., 8 WPP threads). The number of threads maybe less than a maximum number of available threads.

As shown in FIG. 20, a next picture slice compression may be startedwith a new WPP thread (WPP TH 9), for example, after SAO processing ofthree CTU rows. A new WPP thread, for example WPP TH 10, may beinitiated for processing the next available CTU row in next picture andcurrent WPP thread may start processing the next available CTU rowcompression in the current picture, for example, as soon as a WPP threadcompletes the processing of one CTU row. Limiting the number of WPPthreads used for encoding of a picture (e.g., 8 WPP threads in FIG. 20)to less than the maximum number of available threads may permitscheduling more threads for processing of a next picture while currentpicture encoding is in progress.

FIG. 21 shows an example of scheduling WPP threads for TL-0/1/2 picturesand reusing the threads for next picture encoding. Compared to anexample shown in FIG. 20, an example shown in FIG. 21 may focus more onencoding a current picture.

A slice/picture compression process may start, for example, by using amaximum number of WPP threads. For example, a maximum number of WPPthreads may be calculated based on a minimum number of available CPUcores or a minimum number of CTU rows in slice. Next picture slicecompression may be started using the existing WPP thread (WPP TH 0), forexample, after SAO processing of three CTU rows. A WPP thread (e.g., WPPTH 3) that completes processing of one CTU row process (e.g., WPP TH 3)may, for example, be reused for processing the next available CTU row incurrent picture or the next available CTU row in the next picture, forexample, when encoding of CTU rows in the current picture has eithercompleted or is being processed by a thread.

An encoder may determine the dependencies in a GOP, for example, bypre-analysis. Pre-analysis may be implemented, for example, by alook-ahead or based on pre-set configurations.

FIGS. 26A and 26B are flow charts showing examples of thread schedulingfor multiple picture parallel encoding. FIG. 26A presents an examplewhere threads may be allocated in coding order. FIG. 26B presents anexample where threads are allocated based on priority.

Several example variables in FIGS. 26A, 26B, 27A and 27B may be defined.For example, Pic(i) may be defined as an i-th picture in current groupof pictures encoding in parallel. NT_(max)(P) may be defined as amaximum number of threads for one picture P. NT(Pic) may be defined as anumber of threads working on picture Pic. Given a pool of WPP threads{T_(k)|0≤k≤MT}.

In an example, assume that threads are initially in the thread pool andno thread is yet allocated to any picture. Threads may be progressivelyassigned to pictures in a GOP. Assignment may begin with the firstcoding picture Pic(i) and proceed on to following pictures. A startingnumber of coding pictures may be equal to the size of the GOP, amongwhich threads may be allocated.

As shown in FIG. 26A, a thread from the thread pool may be allocated inencoding order. At 2600 of FIG. 26A, a thread (T) may be allocated froma thread pool. At 2602, it may be determined whether the number ofthreads for the i-th picture (e.g., the i-th picture in the currentgroup of pictures encoding in parallel) is greater than, or equal to,the maximum number of threads for the i-th picture (e.g., the i-thpicture in the current group of pictures encoding in parallel). If thenumber of threads for the i-th picture is greater than or equal to themaximum number of threads for the i-th picture, the example method maymove to 2604. If the number of threads for the i-th picture is notgreater than or equal to the maximum number of threads for the i-thpicture, the example method may move to 2614.

At 2614, it may be determined whether there are any uncoded CTU row(s)(e.g., at least one uncoded CTU row) in the i-th picture of the currentgroup of pictures, and it may be determined whether a dependency mayhave been cleared. If there is not at least one CTU row in the i-thpicture of the current group of picture, and/or if the dependency is notcleared, the thread may be returned to the thread pool at 2616. If thereare any CTU row(s) in the i-th picture of the current group of picture,and if the dependency is cleared, the example method may move to 2618.At 2618, the thread may be assigned to the i-th picture in the currentgroup of pictures, and the number of threads working on the i-th picturein the current group of pictures may be incremented by 1.

At 2602, it may be determined whether the number of threads for the i-thpicture (e.g., the i-th picture in the current group of picturesencoding in parallel), is greater than, or equal to, the maximum numberof threads for the i-th picture (e.g., the i-th picture in the currentgroup of pictures encoding in parallel). If the number of threads forthe i-th picture is greater than or equal to the maximum number ofthreads for the i-th picture, the example method may move to 2604. At2604, the example method may increment to the next picture (e.g., i-th+1picture) in the current group of pictures encoding in parallel. A checkis performed, at 2606, to determine whether the number of threads forthe next picture (e.g., i-th+1) is less than the the maximum number ofthreads for the next picture (e.g., the i-th+1 picture). As shown onFIG. 26A, at 2606, it may be determined whether the dependency has beencleared. If both conditions of 2606 are satisfied (e.g., the number ofthreads for the next picture (e.g., i-th+1) is less than the the maximumnumber of threads for the next picture (e.g., the i-th+1 picture), andthe dependency has been cleared), the example method moves to 2608. At2608, the thread is assigned to the next picture (e.g., the i-th+1picture), and the number of threads for the next picture is incremented.If one, or both, conditions of 2606 are not satisfied (e.g., the numberof threads for the next picture (e.g., i-th+1) is not less than the themaximum number of threads for the next picture (e.g., the i-th+1picture), and/or the dependency has not been cleared), the examplemethod moves to 2612.

At 2612, whether the remaining pictures have been checked may bedetermined. If the remaining pictures have been checked, the thread maybe returned to the thread pool, at 2616. If the remaining pictures havenot been checked, the number of the picture is incremented, at 2610, and2606 may be performed, as described herein.

In an example shown in FIG. 26B, a thread from the thread pool may beallocated based on priority between reference pictures and non-referencepictures. For example, a thread may be allocated to reference pictureswithin the set of coding pictures, for example, because more picturesdepend on reference pictures, which may encourage prioritizing thecompletion of encoding of the reference pictures first A thread may beallocated to non-reference pictures in encoding order, for example, whenthere is no available reference picture for thread assignment. In anexample shown in FIG. 26B, reference pictures and non-reference picturesmay be distinguished by the thread scheduling process, with referencepictures being given priority over non-reference pictures.

As shown in FIG. 26B, threads may be allocated based on priority (e.g.,reference pictures having a higher priority than non-referencepictures). Many processes shown on FIG. 26A, and described herein, areshown on FIG. 26B. Where like processes are shown on FIG. 26A and FIG.26B, like element numbers are used. For example, 2600, 2602, 2604, 2608,2610, 2612, 2614, 2616, and 2618 are shown on FIG. 26B and FIG. 26A, toshow similar processes. FIG. 26B differs, in part, from FIG. 26A, inthat the example method shown in FIG. 26B may discriminate threadallocation based on whether a picture is a reference picture or anon-reference picture.

FIG. 26B may be similar to FIG. 26A with respect to 2600 through 2604.From 2604, FIG. 26B depicts the example method performing 2650, and FIG.26A depicts the example method performing 2606. Step 2650 of FIG. 26Bmay differ from 2606 of FIG. 26A, in part, based on 2650 of FIG. 26Bdetermining whether the next picture (e.g., i-th+1) picture is areference picture. If the next picture (e.g., i-th+1) picture is areference picture, 2650 of FIG. 268 may be equivalent to 2606 of FIG.26A, and the remaining steps 2608, 2610, 2612, 2614, 2616, and 2618 maybe similar to both the example method shown in FIG. 26B and the examplemethod shown in FIG. 26A.

As shown in FIG. 26B, at 2650, if the next picture (e.g., i-th+1)picture is not a reference picture, at 2612, whether the remainingpictures have been checked may be determined. If the remaining pictureshave not been checked, the example method may move to 2610, as describedherein. If it is determined that the remaining pictures have beenchecked, the picture is incremented at 2652. At 2654, it may bedetermined whether the incremented picture is a non-reference picture,whether the number of the threads for the incremented picture is lessthan the maximum number of threads permitted for the incrementedpicture, and whether dependency has been cleared. If the conditionschecked at 2654 are answered in the affirmative, the thread may beassigned to the incremented picture, and the number of threads for theincremented pictures may be incremented by 1, at 2656. The thread may bereturned to the thread pool, at 2616.

If any one of the checks are answered in the negative at 2654 (e.g., ifthe incremented picture is a reference picture, or if the number of thethreads for the incremented picture is not less than the maximum numberof threads permitted for the incremented picture, or if the dependencyhas not been cleared), the example method may determine whether theremaining pictures have been checked, at 2658. If the remaining pictureshave been checked, the thread may be returned to the thread pool, at2616. If the remaining pictures have not been checked, the picture maybe incremented, and 2654 may be performed, as described herein.

The order in which pictures of the GOP may be checked (e.g., thenumbering order of pictures as enumerated by Pic(i)) may be any order.In an example, pictures may be checked in encoding order. In an example,pictures may be checked in a predetermined priority order. Index ‘i’ maybe initially set to the value of the first picture of the GOP in thechecking order, for example, when a thread scheduling process (e.g., asshown in examples in FIGS. 26A, 26B, 27A, 27B) is invoked. A threadscheduling process may check pictures progressively in checking order,for example, until a current thread is assigned to an encoding task inone of the pictures. Index ‘i’ may be set to the lowest valuecorresponding to a picture for which unscheduled encoding tasks arestill remaining, for example, when it is known that some leading subsetof pictures in the checking order have already been completely encodedor have already had threads assigned to relevant encoding tasks forthose pictures. A thread scheduler may track which pictures still haveunscheduled encoding tasks, and may check those pictures, for example,when the thread scheduler process is called. For example, a threadscheduler may avoid checking pictures that do not have unscheduledencoding tasks remaining.

A thread scheduling process, examples of which are depicted in FIGS.26A, 26B, 27A and 27B may check whether one or multiple dependencyconstraints have been cleared for a picture, for a portion of a picture,for a CTU row, etc. Checking dependency constraints may determinewhether a picture, a portion of a picture, a CTU row, etc. may be readyto be coded (e.g., encoded and/or decoded). Being ready for encoding maymean, for example, no outstanding dependency constraints would preventthe start of encoding, so that the thread scheduler may assign a threadto the encoding task related to the picture, the portion of the picture,the CTU row, etc. Dependency checking may include consideration of thesearch range (e.g., motion vector search window size) for referencepictures of both reference lists for a CTU row. For example, dependencychecking may check that the pixels of any reference pictures on whichthe motion vector searching for the CTU row depends are alreadyavailable. Dependency checking may include consideration of thedependency of a WPP thread on the above CTU row. For example, adependency check may check that the above CTU row is already encoded, orthat enough of the above CTU row is already encoded to allow a WPPthread to begin encoding the current CTU row. Dependency checking mayconsider both search range and CTU row dependency. Dependency checkingmay consider any other dependency relationship that may prevent thestart of encoding for a CTU row.

A thread scheduling algorithm may be invoked, e.g., periodically, by athread management process. A thread scheduling algorithm may betriggered, for example, when a thread completes a previously assignedtask (e.g., completes encoding of an assigned WPP row) and/or when athread is returned to the thread pool. A trigger may be active, forexample, when thread re-allocation is needed.

A thread migration algorithm may implement a thread migration technique.FIGS. 27A and 27B are flow charts showing examples of thread migrationfor multiple picture parallel encoding. FIG. 27A presents an examplewhere threads are migrated in coding order. FIG. 27B presents an examplewhere threads are migrated based on priority.

Threads from a thread pool may be allocated to a first picture in codingorder. A thread migration algorithm, e.g., as shown in FIG. 27A or 27B,may be invoked when a thread is done (e.g. when that thread hascompleted its task, for example, when a WPP thread has completedencoding of a CTU row). An algorithm may migrate a thread in codingorder, e.g., as shown in FIG. 27A. An algorithm may migrate a threadbased on a picture's priority (e.g., reference pictures prioritized overnon-reference pictures), e.g., as shown in FIG. 27B. Thread migrationtechniques may gradually allocate threads from a thread pool amongpictures.

Thread waiting time, e.g., due to dependency of WPP threads in onepicture, may be reduced, for example, by allocating threads to certainpictures evenly if such pictures have the same priority. Pictures havingthe same priority may be, for example, non-reference pictures orpictures of a particular temporal level (e.g., TL-3 pictures). Threadsmay be scheduled or migrated among pictures having the same priority,for example, in round-robin fashion.

A maximum number of threads for a picture NT_(max)(P) may be set byvarious techniques. An example technique may set the maximum number ofthreads based on a reference picture or a non-reference picture. Forexample, reference pictures may be assigned a maximum of four threads,while non-reference pictures may be assigned a maximum of two threads(e.g. equal to the number of cores available/number of non-referencepictures). Reference pictures may be assigned a higher value of amaximum value of threads NT_(max)(P) than non-reference pictures.

An example technique may set a threshold based on the temporal level apicture belongs to according to hierarchical coding structure. Forexample, pictures of a first temporal level may be assigned a maximum ofsix threads, pictures of a second temporal level may be assigned amaximum of four threads, and pictures of a third temporal level may beassigned a maximum of two threads. In an example, maximum threadassignments may refer to a maximum number of WPP threads assigned to apicture. Pictures of a higher temporal level may be assigned a lowervalue of a maximum number of threads NT_(max)(P) than those of a lowertemporal level.

An example technique may set a threshold according to a picture positionin a GOP structure. For example, a first picture in a GOP may beassigned a maximum of six threads and a second picture in a GOP may beassigned a maximum of four threads. Pictures that may be expected toserve as reference pictures for a greater number of subsequent picturesmay be assigned a larger value of a maximum number of threadsNT_(max)(P).

Multiple (e.g., two) thread pools may be used. A first thread pool maybe used to encode reference pictures while a second thread pool may beused to encode non-reference pictures. Scheduling of threads from a poolto the encoding of reference pictures and non-reference pictures,respectively, may be carried out by a scheduling algorithm (e.g.,example algorithms discussed herein).

Thread management may be applied, for example, to multi-GOP parallelencoding. WPP encoding threads may be allocated to more than one GOP.Threads may be allocated to multiple GOPs, for example, when there areavailable computation resources, such as in multi-core platforms.

For example, a first GOP may include pictures associated with differenttemporal levels, such as pictures in a first-temporal level and picturesin a second-temporal level. A second GOP may include pictures associatedwith different temporal levels, such as pictures in a first-temporallevel, and pictures in a second-temporal level. The second-temporallevel pictures may be, e.g., non-reference pictures, and thefirst-temporal level pictures may be, e.g., reference pictures. When thefirst-temporal level pictures of the first GOP have finished beingcoded, the available threads may be assigned to code the first-temporallevel pictures of the second GOP. The first-temporal level pictures ofthe second GOP, and the second-temporal level pictures of the first GOPmay be coded in parallel using their respective allocated parallelprocessing threads.

Threads may be managed for parallel encoding of multi-GOP pictures. Inan example encoder/decoder configuration. TL-3 pictures may not be usedas reference pictures. TL-0/1/2 picture encoding may not depend on TL-3pictures in a previous GOP. TL-0/1/2 picture encoding in GOP-1 may bescheduled and completed. Encoding of TL-3 pictures (POC 1, 3, 5, and 7)in GOP-1 may start along with encoding of TL-0/1/2 pictures (POC 16, 12,10 and 14) in GOP-2.

FIG. 22 is a diagram of an example of a multi-GOP encoding architecture.Second temporal level (e.g., TL-3) pictures of a first GOP (e.g.,GOP-(n−1)) (e.g. POC 1, 3, 5, 7) and first temporal level (e.g.,non-TL-3) pictures of a second GOP (e.g., GOP-n) (e.g., POC 16, 12, 10,14) may be encoded in parallel.

A multi-GOP encoding framework may use IPWPP to encode TL-0/1/2 picturesbelonging to a next GOP. A multi-GOP encoding framework may use amulti-threaded encoding framework to encode TL-3 pictures belonging to acurrent GOP. A multi-GOP encoding framework may be useful, for example,for computers with higher memory and a higher number of CPU cores.Intermediate encoding data for two GOPs of pictures (for example, if GOPis 8, then 16 pictures) may be maintained in a multi-GOP encodingframework.

TL-3 pictures of the current GOP and TL-0/1/2 pictures of next GOP maybe encoded sequentially. TL-3 pictures of the current GOP and TL-0/1/2pictures of next GOP may be encoded in parallel. A picture in the twosets of pictures may utilize an available number of WPP threads, e.g., aminimum number of available CPU cores or a minimum number of CTU rows ina slice, for a compression process along with an efficient threadscheduling mechanism as described herein.

TL-3 pictures of GOP-(n−1) and non-TL-3 pictures of GOP-n may be encodedin parallel. Encoding may be done asynchronously. TL-3 of GOP-(n−1)encoding may be faster than non-TL-3 of GOP-n encoding. Free threads,e.g., threads released from TL-3 encoding, may be allocated to non-TL-3pictures in GOP-n to accelerate GOP-n non-TL-3 picture encoding.

A thread scheduling algorithm, e.g., as shown in FIGS. 26A, 26B, and athread migration algorithm, e.g., as shown in FIGS. 27A, 27B, may beextended, for example, by increasing the range of picture checking forthread assignment. This may be used, for example, in multi-GOP parallelencoding. This technique may not differentiate which GOP a picturebelongs to. There may be one thread pool from a thread management pointof view.

A thread pool may be separated into multiple pools, for example, basedon a number of GOPs that are being coded in parallel. For example, theremay be two thread pools for parallel encoding two GOPs. A thread poolmay be allocated to coding a GOP, for example, to avoid interferenceamong GOPs. With multiple thread pools, GOP encoding may be in order.Some applications, such as live video broadcasting, may have a delayrequirement, which may favor one technique over another. In examples,the parallel processing threads may be allocated to the first GOP and/orthe second GOP based on a priority.

FIG. 23 shows an example of average encoding time per frame fordifferent bit streams using different examples of HEVC optimizationtechniques. Bitstream examples are BasketballDrive (BB), ParkScene (PS),Kimono and Cactus. Simulations were performed on a desktop computer withan Intel Core i7 2600 processor having four cores running at 3.4 GHz.Five optimization results are shown: HM reference encoder, SIMDOptimization, SIMD optimization with WPP optimization, Multi-threadedprocessing for TL-3 pictures plus SIMD optimization with WPPoptimization and Multi-threaded processing for TL-0/1/2 plusMulti-threaded processing for TL-3 pictures and SIMD optimization withWPP optimization.

FIG. 24 shows an example of Rate-Distortion (RD) curves for an optimizedHEVC encoder compared to an X265 encoder (v 1.0) for a ParkScene bitstream.

FIG. 25 shows an example of Rate-Distortion (RD) curves for an optimizedHEVC encoder compared to an X265 encoder (v 1.0) for a BasketballDrivebit stream.

FIG. 28A is a diagram of an example communications system 100 in whichone or more disclosed embodiments may be implemented. The communicationssystem 100 may be a multiple access system that provides content, suchas voice, data, video, messaging, broadcast, etc., to multiple wirelessusers. The communications system 100 may enable multiple wireless usersto access such content through the sharing of system resources,including wireless bandwidth. For example, the communications systems100 may employ one or more channel access techniques, such as codedivision multiple access (CDMA), time division multiple access (TDMA),frequency division multiple access (FDMA), orthogonal FDMA (OFDMA),single-carrier FDMA (SC-FDMA), and the like.

As shown in FIG. 28A, the communications system 100 may include wirelesstransmit/receive units (WTRUs) 102 a, 102 b, 102 c, and/or 102 d (whichgenerally or collectively may be referred to as WTRU 102), a radioaccess network (RAN) 103/104/105, a core network 106/107/109, a publicswitched telephone network (PSTN) 108, the Internet 110, and othernetworks 112, though it will be appreciated that the disclosedembodiments contemplate any number of WTRUs, base stations, networks,and/or network elements. WTRUs 102 a, 102 b, 102 c, 102 d may be anytype of device configured to operate and/or communicate in a wirelessenvironment. By way of example, the WTRUs 102 a, 102 b, 102 c, 102 d maybe configured to transmit and/or receive wireless signals and mayinclude user equipment (UE), a mobile station, a fixed or mobilesubscriber unit, a pager, a cellular telephone, a personal digitalassistant (PDA), a smartphone, a laptop, a netbook, a personal computer,a wireless sensor, consumer electronics, and the like.

The communications systems 100 may also include a base station 114 a anda base station 114 b. Base stations 114 a, 114 b may be any type ofdevice configured to wirelessly interface with at least one of the WTRUs102 a, 102 b, 102 c, 102 d to facilitate access to one or morecommunication networks, such as the core network 106/107/109, theInternet 110, and/or the networks 112. By way of example, the basestations 114 a, 114 b may be a base transceiver station (BTS), a Node-B,an eNode B, a Home Node B, a Home eNode B, a site controller, an accesspoint (AP), a wireless router, and the like. While base stations 114 a,1.14 b are each depicted as a single element, it will be appreciatedthat the base stations 114 a, 114 b may include any number ofinterconnected base stations and/or network elements.

The base station 114 a may be part of the RAN 103/104/105, which mayalso include other base stations and/or network elements (not shown),such as a base station controller (BSC), a radio network controller(RNC), relay nodes, etc. The base station 114 a and/or the base station114 b may be configured to transmit and/or receive wireless signalswithin a particular geographic region, which may be referred to as acell (not shown). The cell may further be divided into cell sectors. Forexample, the cell associated with the base station 114 a may be dividedinto three sectors. Thus, in one embodiment, the base station 114 a mayinclude three transceivers, e.g., one for each sector of the cell. Inanother embodiment, the base station 114 a may employ multiple-inputmultiple output (MIMO) technology and, therefore, may utilize multipletransceivers for each sector of the cell.

The base stations 114 a, 114 b may communicate with one or more of theWTRUs 102 a, 102 b, 102 c, 102 d over an air interface 151/116/117,which may be any suitable wireless communication link (e.g., radiofrequency (RF), microwave, infrared (IR), ultraviolet (UV), visiblelight, etc.). The air interface 115/116/117 may be established using anysuitable radio access technology (RAT).

More specifically, as noted above, the communications system 100 may bea multiple access system and may employ one or more channel accessschemes, such as CDMA, TDMA, FDMA, OFDMA, SC-FDMA, and the like. Forexample, the base station 114 a in the RAN 103/104/105 and the WTRUs 102a, 102 b, 102 c may implement a radio technology such as UniversalMobile Telecommunications System (UMTS) Terrestrial Radio Access (UTRA),which may establish the air interface 115/116/117 using wideband CDMA(WCDMA). WCDMA may include communication protocols such as High-SpeedPacket Access (HSPA) and/or Evolved HSPA (HSPA+). HSPA may includeHigh-Speed Downlink Packet Access (HSDPA) and/or High-Speed UplinkPacket Access (HSUPA).

In another embodiment, the base station 114 a and the WTRUs 102 a, 102b, 102 c may implement a radio technology such as Evolved UMTSTerrestrial Radio Access (E-UTRA), which may establish the air interface115/116/117 using Long Term Evolution (LTE) and/or LTE-Advanced (LTE-A).

In other embodiments, the base station 114 a and the WTRUs 102 a, 102 b,102 c may implement radio technologies such as IEEE 802.16 (e.g.,Worldwide Interoperability for Microwave Access (WiMAX)), CDMA2000,CDMA2000 IX, CDMA2000 EV-DO, Interim Standard 2000 (IS-2000), InterimStandard 95 (TS-95), Interim Standard 856 (IS-856), Global System forMobile communications (GSM), Enhanced Data rates for GSM Evolution(EDGE), GSM EDGE (GERAN), and the like.

The base station 114 b in FIG. 28A may be a wireless router, Home NodeB, Home eNode B, or access point, for example, and may utilize anysuitable RAT for facilitating wireless connectivity in a localized area,such as a place of business, a home, a vehicle, a campus, and the like.In one embodiment, the base station 114 b and the WTRUs 102 c, 102 d mayimplement a radio technology such as IEEE 802.11 to establish a wirelesslocal area network (WLAN). In another embodiment, the base station 114 band the WTRUs 102 c, 102 d may implement a radio technology such as IEEE802.15 to establish a wireless personal area network (WPAN). In yetanother embodiment, the base station 114 b and the WTRUs 102 c, 102 dmay utilize a cellular-based RAT (e.g., WCDMA, CDMA2000, GSM, LTE,LTE-A, etc.) to establish a picocell or femtocell. As shown in FIG. 28A,the base station 114 b may have a direct connection to the Internet 110.Thus, the base station 114 b may not be required to access the Internet110 via the core network 106/107/109.

The RAN 103/104/105 may be in communication with the core network106/107/109, which may be any type of network configured to providevoice, data, applications, and/or voice over internet protocol (VoIP)services to one or more of the WTRUs 102 a, 102 b, 102 c, 102 d. Forexample, the core network 106/107/109 may provide call control, billingservices, mobile location-based services, pro-paid calling, Internetconnectivity, video distribution, etc., and/or perform high-levelsecurity functions, such as user authentication. Although not shown inFIG. 28A, it will be appreciated that the RAN 103/104/105 and/or thecore network 106/107/109 may be in direct or indirect communication withother RANs that employ the same RAT as the RAN 103/104/105 or adifferent RAT. For example, in addition to being connected to the RAN103/104/105, which may be utilizing an E-UTRA radio technology, the corenetwork 106/107/109 may also be in communication with another RAN (notshown) employing a GSM radio technology.

The core network 106/107/109 may also serve as a gateway for the WTRUs102 a, 102 b, 102 c, 102 d to access the PSTN 108, the Internet 110,and/or other networks 112. The PSTN 108 may include circuit-switchedtelephone networks that provide plain old telephone service (POTS). TheInternet 110 may include a global system of interconnected computernetworks and devices that use common communication protocols, such asthe transmission control protocol (TCP), user datagram protocol (UDP)and the internet protocol (IP) in the TCP/IP internet protocol suite.The networks 112 may include wired or wireless communications networksowned and/or operated by other service providers. For example, thenetworks 112 may include another core network connected to one or moreRANs, which may employ the same RAT as the RAN 103/104/105 or adifferent RAT.

One or more WTRUs 102 a, 102 b, 102 c, 102 d in the communicationssystem 100 may include multi-mode capabilities, e.g., the WTRUs 102 a,102 b. 102 c, 102 d may include multiple transceivers for communicatingwith different wireless networks over different wireless links. Forexample, the WTRU 102 c shown in FIG. 28A may be configured tocommunicate with the base station 114 a, which may employ acellular-based radio technology, and with the base station 114 b, whichmay employ an IEEE 802 radio technology.

FIG. 28B is a system diagram of an example WTRU 102. As shown in FIG.28B, the WTRU 102 may include a processor 118, a transceiver 120, atransmit/receive element 122, a speaker/microphone 124, a keypad 126, adisplay/touchpad 128, non-removable memory 130, removable memory 132, apower source 134, a global positioning system (GPS) chipset 136, andother peripherals 138. It will be appreciated that the WTRU 102 mayinclude any sub-combination of the foregoing elements while remainingconsistent with an embodiment. Also, embodiments contemplate that thebase stations 114 a and 114 b, and/or the nodes that base stations 114 aand 114 b may represent, such as but not limited to transceiver station(BTS), a Node-B, a site controller, an access point (AP), a home node-B,an evolved home node-B (eNodeB), a home evolved node-B (HeNB), a homeevolved node-B gateway, and proxy nodes, among others, may include oneor more of the elements depicted in FIG. 28B and described herein.

The processor 118 may be a general purpose processor, a special purposeprocessor, a conventional processor, a digital signal processor (DSP), aplurality of microprocessors, one or more microprocessors in associationwith a DSP core, a controller, a microcontroller, Application SpecificIntegrated Circuits (ASICs), Field Programmable Gate Array (FPGAs)circuits, any other type of integrated circuit (IC), a state machine,and the like. The processor 118 may perform signal coding, dataprocessing, power control, input/output processing, and/or any otherfunctionality that enables the WTRU 102 to operate in a wirelessenvironment. The processor 118 may be coupled to the transceiver 120,which may be coupled to the transmit/receive element 122. While FIG. 28Bdepicts the processor 118 and the transceiver 120 as separatecomponents, it will be appreciated that the processor 118 and thetransceiver 120 may be integrated together in an electronic package orchip.

The transmit/receive element 122 may be configured to transmit signalsto, or receive signals from, a base station (e.g., the base station 114a) over the air interface 115/116/117. For example, in one embodiment,the transmit/receive element 122 may be an antenna configured totransmit and/or receive RF signals. In another embodiment, thetransmit/receive element 122 may be an emitter/detector configured totransmit and/or receive IR, UV, or visible light signals, for example.In yet another embodiment, the transmit/receive element 122 may beconfigured to transmit and receive both RF and light signals. It will beappreciated that the transmit/receive element 122 may be configured totransmit and/or receive any combination of wireless signals.

In addition, although the transmit/receive element 122 is depicted inFIG. 28B as a single element, the WTRU 102 may include any number oftransmit/receive elements 122. More specifically, the WTRU 102 mayemploy MIMO technology. Thus, in one embodiment, the WTRU 102 mayinclude two or more transmit/receive elements 122 (e.g., multipleantennas) for transmitting and receiving wireless signals over the airinterface 115/116/117.

The transceiver 120 may be configured to modulate the signals that areto be transmitted by the transmit/receive element 122 and to demodulatethe signals that are received by the transmit/receive element 122. Asnoted above, the WTRU 102 may have multi-mode capabilities. Thus, thetransceiver 120 may include multiple transceivers for enabling the WTRU102 to communicate via multiple RATs, such as UTRA and IEEE 802.11, forexample.

The processor 118 of the WTRU 102 may be coupled to, and may receiveuser input data from, the speaker/microphone 124, the keypad 126, and/orthe display/touchpad 128 (e.g., a liquid crystal display (LCD) displayunit or organic light-emitting diode (OLED) display unit). The processor118 may also output user data to the speaker/microphone 124, the keypad126, and/or the display/touchpad 128. In addition, the processor 118 mayaccess information from, and store data in, any type of suitable memory,such as the non-removable memory 130 and/or the removable memory 132.The non-removable memory 130 may include random-access memory (RAM),read-only memory (ROM), a hard disk, or any other type of memory storagedevice. The removable memory 132 may include a subscriber identitymodule (SIM) card, a memory stick, a secure digital (SD) memory card,and the like. In other embodiments, the processor 118 may accessinformation from, and store data in, memory that is not physicallylocated on the WTRU 102, such as on a server or a home computer (notshown).

The processor 118 may receive power from the power source 134, and maybe configured to distribute and/or control the power to the othercomponents in the WTRU 102. The power source 134 may be any suitabledevice for powering the WTRU 102. For example, the power source 134 mayinclude one or more dry cell batteries (e.g., nickel-cadmium (NiCd),nickel-zinc (NiZn), nickel metal hydride (NiMH), lithium-ion (Li-ion),etc.), solar cells, fuel cells, and the like.

The processor 118 may also be coupled to the GPS chipset 136, which maybe configured to provide location information (e.g., longitude andlatitude) regarding the current location of the WTRU 102. In additionto, or in lieu of the information from the GPS chipset 136, the WTRU 102may receive location information over the air interface 115/116/117 froma base station (e.g., base stations 114 a, 114 b) and/or determine itslocation based on the timing of the signals being received from two ormore nearby base stations. It will be appreciated that the WTRU 102 mayacquire location information by way of any suitablelocation-determination technique while remaining consistent with anembodiment.

The processor 118 may further be coupled to other peripherals 138, whichmay include one or more software and/or hardware modules that provideadditional features, functionality and/or wired or wirelessconnectivity. For example, the peripherals 138 may include anaccelerometer, an c-compass, a satellite transceiver, a digital camera(for photographs or video), a universal serial bus (USB) port, avibration device, a television transceiver, a hands free headset, aBluetooth® module, a frequency modulated (FM) radio unit, a digitalmusic player, a media player, a video game player module, an Internetbrowser, and the like.

FIG. 28C is a system diagram of the RAN 103 and the core network 106according to an embodiment. As noted above, the RAN 103 may employ aUTRA radio technology to communicate with the WTRUs 102 a, 102 b, 102 cover the air interface 115. The RAN 103 may also be in communicationwith the core network 106. As shown in FIG. 28C, the RAN 103 may includeNode-Bs 140 a, 140 b, 140 c, which may each include one or moretransceivers for communicating with the WTRUs 102 a, 102 b, 102 c overthe air interface 115. The Node-Bs 140 a, 140 b, 140 c may each beassociated with a particular cell (not shown) within the RAN 103. TheRAN 103 may also include RNCs 142 a, 142 b. It will be appreciated thatthe RAN 103 may include any number of Node-Bs and RNCs while remainingconsistent with an embodiment.

As shown in FIG. 28C, the Node-Bs 140 a, 140 b may be in communicationwith the RNC 142 a. Additionally, the Node-B 140 c may be incommunication with the RNC 142 b. The Node-Bs 140 a, 1406, 140 c maycommunicate with the respective RNCs 142 a, 142 b via an Iub interface.The RNCs 142 a, 142 b may be in communication with one another via anIur interface. Each of the RNCs 142 a, 142 b may be configured tocontrol the respective Node-Bs 140 a, 140 b, 140 c to which it isconnected. In addition, each of the RNCs 142 a, 142 b may be configuredto carry out or support other functionality, such as outer loop powercontrol, load control, admission control, packet scheduling, handovercontrol, macrodiversity, security functions, data encryption, and thelike.

The core network 106 shown in FIG. 28C may include a media gateway (MGW)144, a mobile switching center (MSC) 146, a serving GPRS support node(SGSN) 148, and/or a gateway GPRS support node (GGSN) 150. While each ofthe foregoing elements are depicted as part of the core network 106, itwill be appreciated that any one of these elements may be owned and/oroperated by an entity other than the core network operator.

The RNC 142 a in the RAN 103 may be connected to the MSC 146 in the corenetwork 106 via an IuCS interface. The MSC 146 may be connected to theMGW 144. The MSC 146 and the MGW 144 may provide the WTRUs 102 a, 102 b,102 c with access to circuit-switched networks, such as the PSTN 108, tofacilitate communications between the WTRUs 102 a, 102 b, 102 c andland-line communications devices.

The RNC 142 a in the RAN 103 may also be connected to the SGSN 148 inthe core network 106 via an IuPS interface. The SGSN 148 may beconnected to the GGSN 150. The SGSN 148 and the GGSN 150 may provide theWTRUs 102 a, 102 b, 102 c with access to packet-switched networks, suchas the Internet 110, to facilitate communications between and the WTRUs102 a, 102 b, 102 c and IP-enabled devices.

As noted above, the core network 106 may also be connected to thenetworks 112, which may include other wired or wireless networks thatare owned and/or operated by other service providers.

FIG. 28D is a system diagram of the RAN 104 and the core network 107according to an embodiment. As noted above, the RAN 104 may employ anE-UTRA radio technology to communicate with the WTRUs 102 a, 102 b, 102c over the air interface 116. The RAN 104 may also be in communicationwith the core network 107.

The RAN 104 may include eNode-Bs 160 a. 160 b, 160 c, though it will beappreciated that the RAN 104 may include any number of eNode-Bs whileremaining consistent with an embodiment. The eNode-Bs 160 a, 160 b, 160c may each include one or more transceivers for communicating with theWTRUs 102 a, 102 b, 102 c over the air interface 116. In one embodiment,the eNode-Bs 160 a, 160 b, 160 c may implement MIMO technology. Thus,the eNode-B 160 a, for example, may use multiple antennas to transmitwireless signals to, and receive wireless signals from, the WTRU 102 a.

Each of the eNode-Bs 160 a, 160 b, 160 c may be associated with aparticular cell (not shown) and may be configured to handle radioresource management decisions, handover decisions, scheduling of usersin the uplink and/or downlink, and the like. As shown in FIG. 28D, theeNode-Bs 160 a. 160 b, 160 c may communicate with one another over an X2interface.

The core network 107 shown in FIG. 28D may include a mobility managementgateway (MME) 162, a serving gateway 164, and a packet data network(PDN) gateway 166. While each of the foregoing elements are depicted aspart of the core network 107, it will be appreciated that any one ofthese elements may be owned and/or operated by an entity other than thecore network operator.

The MME 162 may be connected to each of the eNode-Bs 160 a, 160 b, 160 cin the RAN 104 via an SI interface and may serve as a control node. Forexample, the MME 162 may be responsible for authenticating users of theWTRUs 102 a, 102 b. 102 c, bearer activation/deactivation, selecting aparticular serving gateway during an initial attach of the WTRUs 102 a,102 b, 102 c, and the like. The MME 162 may also provide a control planefunction for switching between the RAN 104 and other RANs (not shown)that employ other radio technologies, such as GSM or WCDMA.

The serving gateway 164 may be connected to each of the eNode-Bs 160 s,160 b, 160 c in the RAN 104 via the SI interface. The serving gateway164 may generally route and forward user data packets to/from the WTRUs102 a, 102 b, 102 c. The serving gateway 164 may also perform otherfunctions, such as anchoring user planes during inter-eNode B handovers,triggering paging when downlink data is available for the WTRUs 102 a,102 b, 102 c, managing and storing contexts of the WTRUs 102 a, 102 b,102 c, and the like.

The serving gateway 164 may also be connected to the PDN gateway 166,which may provide the WTRUs 102 a, 102 b, 102 c with access topacket-switched networks, such as the Internet 110, to facilitatecommunications between the WTRUs 102 a, 102 b. 102 c and IP-enableddevices.

The core network 107 may facilitate communications with other networks.For example, the core network 107 may provide the WTRUs 102 a, 102 b,102 c with access to circuit-switched networks, such as the PSTN 108, tofacilitate communications between the WTRUs 102 a, 102 b, 102 c andland-line communications devices. For example, the core network 107 mayinclude, or may communicate with, an IP gateway (e.g., an IP multimediasubsystem (IMS) server) that serves as an interface between the corenetwork 107 and the PSTN 108. In addition, the core network 107 mayprovide the WTRUs 102 a, 102 b, 102 c with access to the networks 112,which may include other wired or wireless networks that are owned and/oroperated by other service providers.

FIG. 28E is a system diagram of the RAN 105 and the core network 109according to an embodiment. The RAN 105 may be an access service network(ASN) that employs IEEE 802.16 radio technology to communicate with theWTRUs 102 a, 102 b, 102 c over the air interface 117. As will be furtherdiscussed below, the communication links between the differentfunctional entities of the WTRUs 102 a, 102 b, 102 c, the RAN 105, andthe core network 109 may be defined as reference points.

As shown in FIG. 28E, the RAN 105 may include base stations 180 a, 180b, 180 c, and an ASN gateway 182, though it will be appreciated that theRAN 105 may include any number of base stations and ASN gateways whileremaining consistent with an embodiment. The base stations 180 a, 180 b,180 c may each be associated with a particular cell (not shown) in theRAN 105 and may each include one or more transceiver for communicatingwith the WTRUs 102 a, 102 b, 102 c over the air interface 117. In oneembodiment, the base stations 180 a, 180 b, 180 c may implement MIMOtechnology. Thus, the base station 180 a, for example, may use multipleantennas to transmit wireless signals to, and receive wireless signalsfrom, the WTRU 102 a. The base stations 180 a, 180 b, 180 c may alsoprovide mobility management functions, such as handoff triggering,tunnel establishment, radio resource management, traffic classification,quality of service (QoS) policy enforcement, and the like. The ASNgateway 182 may serve as a traffic aggregation point and may beresponsible for paging, caching of subscriber profiles, routing to thecore network 109, and the like.

The air interface 117 between the WTRUs 102 a, 102 b, 102 c and the RAN105 may be defined as an R reference point that implements the IEEE802.16 specification. In addition, each of the WTRUs 102 a, 102 b. 102 cmay establish a logical interface (not shown) with the core network 109.The logical interface between the WTRUs 102 a, 102 b, 102 c and the corenetwork 109 may be defined as an R2 reference point, which may be usedfor authentication, authorization, IP host configuration management,and/or mobility management.

The communication link between each of the base stations 180 a, 180 b,180 c may be defined as an R8 reference point that includes protocolsfor facilitating WTRU handovers and the transfer of data between basestations. The communication link between the base stations 180 a, 180 b,180 c and the ASN gateway 182 may be defined as an R6 reference point.The R6 reference point may include protocols for facilitating mobilitymanagement based on mobility events associated with each of the WTRUs102 a, 102 b, 102 c.

As shown in FIG. 28E, the RAN 105 may be connected to the core network109. The communication link between the RAN 105 and the core network 109may defined as an R3 reference point that includes protocols forfacilitating data transfer and mobility management capabilities, forexample. The core network 109 may include a mobile IP home agent(MIP-HA) 184, an authentication, authorization, accounting (AAA) server186, and a gateway 188. While each of the foregoing elements aredepicted as part of the core network 109, it will be appreciated thatany one of these elements may be owned and/or operated by an entityother than the core network operator.

The MTP-HA may be responsible for IP address management, and may enablethe WTRUs 102 a, 102 b, 102 c to roam between different ASNs and/ordifferent core networks. The MIP-HA 184 may provide the WTRUs 102 a, 102b, 102 c with access to packet-switched networks, such as the Internet110, to facilitate communications between the WTRUs 102 a, 102 b, 102 cand IP-enabled devices. The AAA server 186 may be responsible for userauthentication and for supporting user services. The gateway 188 mayfacilitate interworking with other networks. For example, the gateway188 may provide the WTRUs 102 a, 102 b, 102 c with access tocircuit-switched networks, such as the PSTN 108, to facilitatecommunications between the WTRUs 102 a, 102 b, 102 c and land-linecommunications devices. In addition, the gateway 188 may provide theWTRUs 102 a, 102 b, 102 c with access to the networks 112, which mayinclude other wired or wireless networks that are owned and/or operatedby other service providers.

Although not shown in FIG. 28E, it will be appreciated that the RAN 105may be connected to other ASNs and the core network 109 may be connectedto other core networks. The communication link between the RAN 105 theother ASNs may be defined as an R4 reference point, which may includeprotocols for coordinating the mobility of the WTRUs 102 a, 102 b, 102 cbetween the RAN 105 and the other ASNs. The communication link betweenthe core network 109 and the other core networks may be defined as an RSreference, which may include protocols for facilitating interworkingbetween home core networks and visited core networks.

Although features and elements are described above in particularcombinations, one of ordinary skill in the art will appreciate that eachfeature or element can be used alone or in any combination with theother features and elements. In addition, the techniques describedherein may be implemented in a computer program, software, or firmwareincorporated in a computer-readable medium for execution by a computeror processor. Examples of computer-readable media include electronicsignals (transmitted over wired or wireless connections) andcomputer-readable storage media. Examples of computer-readable storagemedia include, but are not limited to, a read only memory (ROM), arandom access memory (RAM), a register, cache memory, semiconductormemory devices, magnetic media such as internal hard disks and removabledisks, magneto-optical media, and optical media such as CD-ROM disks,and digital versatile disks (DVDs). A processor in association withsoftware may be used to implement a radio frequency transceiver for usein a WTRU, UE, terminal, base station, RNC, or any host computer.

1-31. (canceled)
 32. A device comprising: a processor configured to:receive a video sequence comprising a first group of pictures (GOP) anda second GOP; identify a first plurality of first-GOP pictures and asecond plurality of second-GOP pictures for parallel encoding based ontheir respective temporal levels, wherein the first plurality offirst-GOP pictures are associated with a first temporal level and asecond plurality of second-GOP pictures are associated with a secondtemporal level; allocate one or more parallel processing threads forencoding the first plurality of first-GOP pictures and the secondplurality of second-GOP pictures; and encode, in parallel, theidentified first plurality of first-GOP pictures and second plurality ofsecond-GOP pictures.
 33. The device of claim 32, wherein when the firstplurality of first-GOP pictures have finished being encoded, theprocessor is further configured to encode, in parallel, the secondplurality of second-GOP pictures associated with the second temporallevel.
 34. The device of claim 32, wherein pictures associated with thefirst temporal level are non-reference pictures, and pictures associatedwith the second temporal level pictures are reference pictures fortemporal prediction.
 35. The device of claim 32, wherein the videosequence further comprises a third GOP, and wherein when the firstplurality of first-GOP pictures have finished being encoded, theprocessor is configured to: identify a third plurality of third-GOPpictures associated with the second temporal level; and encode, inparallel, the second plurality of second-GOP pictures and the thirdplurality of third-GOP pictures associated with the second temporallevel using the parallel processing threads.
 36. The device of claim 35,wherein the third plurality of third-GOP pictures do not refer to thesecond plurality of second-GOP pictures.
 37. The device of claim 32,wherein the video sequence further comprises a third GOP, and theprocessor is further configured to: identify a third plurality ofthird-GOP pictures associated with the second temporal level and afourth plurality of second-GOP pictures associated with the firsttemporal level; and encode, in parallel, the fourth plurality ofsecond-GOP pictures associated with the first temporal level and thethird plurality of third-GOP pictures associated with the secondtemporal level using the parallel processing threads.
 38. The device ofclaim 32, wherein the first temporal level is higher than the secondtemporal level.
 39. The device of claim 32, wherein the first temporallevel is temporal level 3, and the second temporal level comprises atleast one of temporal level 0, temporal level 1, or temporal level 2.40. The device of claim 32, wherein the second plurality of second-GOPpictures do not refer to the first plurality of first-GOP pictures. 41.A method comprising: receiving a video sequence comprising a first groupof pictures (GOP) and a second GOP; identifying a first plurality offirst-GOP pictures and a second plurality of second-GOP pictures forparallel encoding based on their respective temporal levels, wherein thefirst plurality of first-GOP pictures are associated with a firsttemporal level and a second plurality of second-GOP pictures areassociated with a second temporal level; allocating one or more parallelprocessing threads for encoding the first plurality of first-GOPpictures and the second plurality of second-GOP pictures; and encoding,in parallel, the identified first plurality of first-GOP pictures andsecond plurality of second-GOP pictures.
 42. The method of claim 41,further comprising: upon completion of encoding the first plurality offirst-GOP pictures, encoding, in parallel, the second plurality ofsecond-GOP pictures associated with the second temporal level.
 43. Themethod of claim 41, wherein pictures associated with the first temporallevel are non-reference pictures, and pictures associated with thesecond temporal level pictures are reference pictures for temporalprediction.
 44. The method of claim 41, wherein the video sequencefurther comprises a third GOP, and the method further comprising: uponcompletion of encoding the first plurality of first-GOP pictures,identifying a third plurality of third-GOP pictures associated with thesecond temporal level; and encoding, in parallel, the second pluralityof second-GOP pictures and the third plurality of third-GOP picturesassociated with the second temporal level using the parallel processingthreads.
 45. The method of claim 44, wherein the third plurality ofthird-GOP pictures do not refer to the second plurality of second-GOPpictures.
 46. The method of claim 41, wherein the video sequence furthercomprises a third GOP, and the method further comprising: identifying athird plurality of third-GOP pictures associated with the secondtemporal level and a fourth plurality of second-GOP pictures associatedwith the first temporal level; and encoding, in parallel, the fourthplurality of second-GOP pictures associated with the first temporallevel and the third plurality of third-GOP pictures associated with thesecond temporal level using the parallel processing threads.
 47. Themethod of claim 41, wherein the first temporal level is higher than thesecond temporal level.
 48. The method of claim 41, wherein the firsttemporal level is temporal level 3, and the second temporal levelcomprises at least one of temporal level 0, temporal level 1, ortemporal level
 2. 49. The method of claim 41, wherein the secondplurality of second-GOP pictures do not refer to the first plurality offirst-GOP pictures.